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Abstract —We consider the problem of optimal matching 
with queues in dynamic systems and investigate the value-of- 
information. In such systems, the operators match tasks and 
resources stored in queues, with the objective of maximizing the 
system utility of the matching reward profile, minus the average 
matching cost. This problem appears in many practical systems 
and the main challenges are the no-underflow constraints, and 
the lack of matching-reward information and system dynamics 
statistics. We develop two online matching algorithms: Learning- 
aided Reward optimAl Matching (LRAM) and Dual-LRAM (DRAM) to 
effectively resolve both challenges. Both algorithms are equipped 
with a learning module for estimating the matching-reward 
information, while DRAM incorporates an additional module for 
learning the system dynamics. We show that both algorithms 
achieve an 0(e + <5 r ) close-to-optimal utility performance for any 
t > 0, while DRAM achieves a faster convergence speed and a 
better delay compared to LRAM, i.e., 0(S Z /t + log(l/e) 2 )) delay 
and 0(5z/t) convergence under DRAM compared to 0(l/e) delay 
and convergence under LRAM (<5 r and S z are maximum estimation 
errors for reward and system dynamics). Our results reveal 
that information of different system components can play very 
different roles in algorithm performance and provide a systematic 
way for designing joint learning-control algorithms for dynamic 
systems. 

I. Introduction 

Matching is a fundamental problem that appears in resource 
allocation in various systems across different areas. For in¬ 
stance, network switch scheduling EL online advertising EL 
crowdsourcing 0. ride sharing 0. cloud computing 0, and 
inventory control 0- Hence, efficient matching algorithms are 
of great importance to system control. 

In this paper, we study the problem of optimal match¬ 
ing with queues in a dynamic environment with unknown 
matching reward statistics. Specifically, we consider a system 
consists of a set of task queues and a set of resource queues, 
which store different types of workload and different types 
of resources that come into the system according to some 
random processes. At every time, the system operator decides 
how to match the resources to the pending workload. Each 
matching incurs a cost that depends on the resource allocated 
and random factors in the system, e.g., changing channel con¬ 
ditions in a downlink system, time-varying prices in inventory 
control, or fluctuating payment requirements in crowdsourcing. 
On the other hand, the matching also generates a reward, 
which is random with an unknown distribution determined 
by the amount of tasks resolved and the system condition. 
The objective is to design a matching strategy that carefully 
manages the resources and tasks, so as to achieve optimal 
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system utility, which is a function of the achieved reward 
profile, subject to the constraint that all tasks are fulfilled 
timely. 

This is a general problem and models the aforementioned 
application scenarios. However, it is very challenging to solve. 
First, the system utility is a function of the matching reward, 
which means that it is affected by when and how much 
resource is actually matched to the tasks and is only indirectly 
related to the traffic rates. This differs significantly from 
traditional flow utility optimization problems (7), 0, and 
requires both careful admission control to avoid instability 
and appropriate matching to achieve good utility. Second, 
since each matching action is rewarded based on the actual 
amount of tasks resolved, the matching scheme must ensure 
that there are nonzero tasks and nonzero resources in the 
queues, i.e., no-underflow. This constraint is complex and is 
mostly tackled with dynamic programming, which can have 
high computational complexity. Third, the system is dynamic 
and the statistics of system conditions and reward functions are 
unknown beforehand. This requires that the matching scheme 
can efficiently leam the sufficient statistics of the randomness 
and adapt to the changing environment. 

In addition to resolving the above challenges, we also take 
one step further and try to investigate the value-of-information 
in such matching systems with queues, by explicitly consid¬ 
ering the impact of information on algorithm performance. 
Existing works on stochastic system control either focus on 
systems with perfect a-prior information, e.g., 13, EH), or 
rely on stochastic approximation techniques that do not require 
such information, e.g., OH, 02. While the proposed solutions 
are effective, they do not capture how information affects al¬ 
gorithm design and performance, and do not provide interfaces 
for integrating the fast-developing “data science” tools, e.g., 
data collecting methods and machine learning algorithms, 03. 
oa. into system control. 

To provide a rigorous quantification of the value of informa¬ 
tion, we first introduce an abstract notion of a learning module, 
which represents a general information learning algorithm 
and features a learning accuracy level 8 (maximum error), 
a learning time Tg, and the probability of learning accuracy 
guarantee Pg. We then design two online matching algorithms: 
Learning-aided Reward optimAl Matching (LRAM) and Dual- 
LRAM (DRAM). LRAM utilizes a single (Tg r , 5 r , Pg r ) learning 
module for estimating the reward statistics and achieves an 
0(e + S r ) system utility, for any e > 0, while ensuring an 
0( 1/e) delay bound and an 0( 1/e) algorithm convergence 
time, defined to be the time taken for the algorithm to enter 
the optimal control state. DRAM incorporates an additional 
( [Tg z , 8 Z , Pg_ ) learning module for the random system state 
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distribution and guarantees a similar 0(e + S r ) system utility. 
Moreover, DRAM is able to achieve an 0(S z /e + log(l/e) 2 ) 
delay bound and an 0(S z /e) algorithm convergence time, 
which can be significantly faster compared to LRAM. 

Our results reveal an interesting fact that the reward in¬ 
formation largely determines the utility performance, while 
the system dynamics information greatly affects delay and 
algorithm convergence. This indicates that information of 
different system components can have different impacts on 
algorithm performance, and may require different learning 
power for achieving a desired goal. Closest to our paper is the 
recent work 03. which considers joint learning and control. 
Our framework allows much more general learning methods 
and resolves the no-underflow constraints. We also quantify 
the values of different system information. 

We summarize the main contributions as follows: 


1) We propose a matching queueing system model, which 
can model general resource-task matching problems in 
stochastic systems. To explicitly quantify the value of 
information in such systems, we introduce an abstract 
notion of a (Tg,8, Pg)-learning module that captures 
key characteristics of general learning algorithms and 
provides interfaces for bringing the information learning 
aspect into system control. 

2) We design two learning-aided matching algorithms LRAM 
and DRAM. We show that with a single (Tg r ,S r , Pg r )- 
module for learning reward statistics, LRAM achieves an 
0(e + S r ) utility, while ensuring an 0( 1/e) delay bound 
and anO(l/e) algorithm convergence time. DRAM adopts 
an additional ( Tg z , 5 Z , Pg z )-module for learning the sys¬ 
tem state distribution, and guarantees a similar 0(e+<5 r ) 
system utility, while achieving an 0(5 z /e + log(l/e) 2 ) 
delay and an 0(5 z /e) convergence time. We also con¬ 
struct two (0(l/e c ), 0(e c / 2 ), 1 — O (e 10 ^ 1 / 6 ))) online 
learning modules based on sampling (c > 0). Combining 
them with DRAM, one achieves a fast 0(l/e 1-c / 2 + l/e c ) 
convergence time with c < 1 (existing algorithms 
require ©(1/e)). 

3) Our algorithm design approach provides a low- 
complexity way to tackle multiple simultaneous no¬ 
underflow constraints in systems and jointly optimize 
utilities that are not defined on flow rates. The develop¬ 
ment of DRAM also demonstrates how general learning 
algorithms can be combined with queue-based control 
(stochastic approximation) to achieve superior delay per¬ 
formance and accelerate algorithm convergence speed. 


The rest of the paper is organized as follows. We first list 
a few motivating examples in Section |II| W e then present 
the matching system model in Section [Till The algorithm 
design approach and the two algorithms LRAM and DRAM are 
presented in Section IV Analysis is carried out in Section [V] 
and simulation results are presented in Section VI We then 


conclude the paper in Section VII 


II. Motivating Examples 

Crowdsourcing: In a crowdsourcing application, e.g., 
crowdsourcing query search 0 or ride-sharing ED, tasks of 


different types (task) arrive at the server and are assigned 
to workers (resource). The workers then carry out the tasks. 
Depending on the workers’ qualifications, the types of jobs, 
and the instantaneous system condition (state), e.g., whether 
a query requestor is in a hurry due to weather, the requestors 
receives certain reward, e.g., satisfaction, and the workers 
receive payments. The objective of the system is to design a 
matching scheme, so as to maximize the system utility, which 
is a function of the achieved requestor reward profile. 

Energy Harvesting Networks: In an energy harvesting 
network, e.g., El, na, nodes are responsible for transmitting 
data (task) and can harvest energy (resource) from the envi¬ 
ronment. At every time, each node decides how much energy 
to allocate for transmission and determines traffic scheduling. 
Depending on the time-varying channel condition (state), the 
amount of energy enables certain processing results. The ob¬ 
jective is to design a joint energy management and scheduling 
algorithm, so as to maximize traffic utility and ensure that no 
energy outage happens. 

Online Advertisement: In an online advertising system, 
0, ca, advertisers deposit money (task) into their accounts 
at the advertising platform. Queries (resource) for different 
keywords arrive in the system and the server decides which 
advertiser’s ads to show, based on their relevance to the key¬ 
words and the available budget of the advertisers. Depending 
on the chosen ad and the user’s condition (state), e.g., location 
or mood, a business transaction may take place. The goal of the 
system is to design an ad matching scheme, so as to maximize 
the system’s utility, which is a function of the average income 
profile from advertisers. 

Cloud Computing: In a cloud computing platform, e.g., 
0, computing resources (resource), e.g., CPU, memory, are 
assigned to virtual machine instances (task) for processing 
arriving job requests. The quality of experience of a requestor 
depends on the job completion quality, which is affected by 
system conditions such as background task level (state) and 
the user status. The objective here is to design a resource 
allocation policy, such that the overall quality of service is 
maximized. 

In all these examples, the underlying problem is indeed 
matching with queues. Below, we present the general model. 

III. System Model 

We consider a discrete-time system shown in Fig. [T] In this 
system, there are two sets of queues, task queues and resource 
queues, and a central server (called operator below), which 
coordinates resource allocation and scheduling in the system. 
Time is divided into unit-size slots, i.e., t £ {0,1,...}. 

A. Tasks and Resources 

The task queues store jobs that come into the system and 
are waiting to be served by the server. We assume there 
are N types of tasks and denote the set of task queues by 
Q = {Qi, ■■■, Qn}- We use A n (t) to denote the amount 
of new tasks arrivaling at Q„ at time t and assume that 
0 < A n (t) < A max . We then define the arrival vector 
A(t) = (A\(t ),..., Apf(t)). In many systems, arrivals to the 





3 


Task Queue Resource Queue 



system may not always all be admitted due to congestion 
control, e.g., when all servers are busy. We model this by using 
0 < < A n (t) to denote the actual admitted traffic to Q. n 

at time t. We then use Q n (t) to denote the amount of tasks 
stored at Q n at time t and denote Q(t) = (Qi(t), ...,Qisi{t)) 
the task queue vector. 

The resource queues, on the other hand, hold the re¬ 
sources the system collects over time. There are M types 
of system resources and we denote the resource queues by 
H = {Hi, We similarly let e m [t) be the amount 

of new resource arriving at H m with 0 < e m (t) < h max . 
We also use H m (t) to denote the amount of resource m the 
system current holds and denote H(t) = {H\{i ),..., 
the resource queue vector. 

In many systems, it is feasible (and sometimes necessary) 
to control the amount of resources in the system, e.g., to avoid 
too many workers waiting in crowdsourcing. We model this 
decision by using h m (t) G [0, e m (t)] to denote the actual 
amount of type m resource admitted. For now, it is also 
convenient to temporarily assume that the queues are all of 
unlimited sizes. We will later show that our algorithms ensure 
that finite buffer sizes are sufficient. 

B. System State and Resource allocation 

We assume that the system has a time-varying condition, 
e.g., the channel conditions in a downlink system, or the ex¬ 
pected happiness measures of human users in a crowdsourcing 
system. We call this condition the system state and model it 
by a random state variable w(t). Note that ui(t) represents the 
aggregate system condition. 

Denote z(t) = (A(t),e(t),ui(t)). In this paper, we assume 
that z(t) is i.i.d. and takes values in Z = {z\,Z k}- We 
then denote Tik = Pr {z(t) = z ^}. Note that this allows 
arbitrary dependency among A(t), e(t), and u>(t). 

At every time t, the system operator determines the amount 
of resource to allocate to serving each queue. We denote this 
decision by a matching matrix b(t) = {b rnn (t), rri, n), where 
b rnn (t) denotes the type m resource allocated to queue n. 
When z(t) = Zk, b(t) takes values from a finite discrete set 
B k C M+{jVe define & max = max beBfcifc H&Hoo the maximum 
amount of resource allocated to any queue at any time. It is 
clear that at any time t, we must have: 

y ^b mn {t) < Mm. (1) 

n 

lr This assum ption is made to simplify the learning algorithm description 
(Section |IV-B|. Our results can likely be extended to the case when k} 
are general compact sets in R+. 


This is because one cannot spent more resource than what 
is available. In the following, we call 0 the no-underflow 
constraint. Depending on the system state and the resource 
allocation decision, each task queue gets a service rate /i„ (t) = 
H n (z(t),b(t)). We assume n n (z(t),b(t )) G [0,/z max ] for all 
z{t) and b(t) and that {fi n (z(t), b(t))} ne jg- are known to the 
operator. Also, they satisfy that fx n (z(t), 0 ) = 0 for all z(t), 
and if ji n (z(t), b) > 0, then 

fJ>n(z(t),b) > fl l min & mn : (2) 

m:b rnn > 0 

for some (3 l > 0. Moreover, if two vectors b and b' are such 
that b' is obtained by setting b rnrl in b to zero, then, 

Hn{z,b) < Hn{z,b') + P™b mn ,Mn. (3) 

Note that 0 and ([3]i are not restrictive. They simply require 
that nonzero resource is needed for getting a positive service 
rate, and that a positive rate is upper and lower bounded by 
linear functions of the resources allocated. 


C. Matching Cost and Reward 

In every time slot, due to resource expenditure, there is a 
matching cost associated with the resource allocation decision. 
We model this by denoting c(t) = c(z(t),b(t)) the cost for 
choosing the resource vector b(t). This cost can represent, e.g., 
cost for purchasing raw materials in inventory control, or pay¬ 
ments to workers in a crowdsourcing application. One example 
is c{z{t),b{t )) = E nm c m(z{t))b nrn {t), where c m (z(t )) 
denotes the per-unit resource price for type m resource under 
state z(t). We assume that c(z(t), b(t )) G [0, c max ] for all time 
and it is known to the system operator. Also, if 0 A bi A b 2 
(entrywise-less), when c(z(t),bi) < c{z{t),b2). 

Every time a matching is completed, the operator col¬ 
lects a matching reward , e.g., a customer conversion due 
to an ad, or user satisfaction due to job completion. We 
model this by denoting the reward collected at time t 
from type n tasks by n n (t). We assume that K n (t) G 
[0,r max ] is an i.i.d. random variable given z(t) and 
b(t), and its mean is determined by the reward func¬ 
tion r n (z(t),p, n (t )) = E{/c n (f) | z{t), b(t),Q(t)}, where 
ji n (t) = min[Q n (t) , denotes the actual amount of tasks 

completed. We assume that r n (z(t), fl n {t)) satisfies: 

r n (z(t),n) < r n (z(t),n '), if \x < [if, (4) 

and denote r = {r n (z, /u n (z,b z )), z G Z,b~ G B z } the 
reward matrix. Since each B z is finite, r is also finite. 

Different from existing works, e.g., cod, m, we do not 
assume any prior knowledge of the functions r(z(t), //) j^J'l his 
is quite common in practice. For example, in crowdsourcing 
applications, it is often unknown a-prior how qualified a 
worker is for a certain type of tasks; or in online advertising, 
one often does not know the conversion probabilities before¬ 
hand. 

2 This is different from the fi functions, which measure how much resources 
are spent and can typically be observed by the system controller. 
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D. Queueing 

From the above, we see that the queue vectors Q(t) and 
H(t) evolve according to: 

Q n {t+ 1) = max[<2„(t) - n n (t),0] +R n (t), Vn, (5) 

H m (t+ 1) = H m (t)~ E bmn {t) + h m (t), Vm, ( 6 ) 

n 

Notice that there is no max]-,-] operator in ([ 6 ]). This is due 
to the no-underflow constraint <|Tb. In our paper, we say that 
a queue vector process x(t ) £ R 7 is stable if it satisfies: [^] 

-it d 

a-’av - lim t E E E { a; ™( T )} < 00 • (7) 

t—>oo t zz ' 
r—0n—1 

£. Utility Optimization 

The system’s utility is determined by a function of the 
average matching reward profile. Specifically, define r n = 
lim^oo \ E { r n( r )}- The system utility is given by: 

U tot ai(r) = ^2 U n (r n ). (8) 

n 

Here each U n (r) is an increasing concave function with 
U n ( 0) = 0 and f7'(0) < oo. We denote /3 = max^^L^r))' 
the maximum first derivative of the utility functions. We also 
define the following system cost due to resource expenditure: 

1 t_1 

Ctotai = lim -V'E{c(r)}. (9) 

t—yoo t *■—' 
t —0 

We say that a matching algorithm n is feasible if for all time 
t, it selects 0 d R(t) d A(f), 0 d h(t ) d e(f), and 6(f) £ 
®z(t)> and it ensures constraint (|Tji for all time. Our objective 
is to design a feasible policy n, so as to: 

max : / av — Unuii (r*) Ctotai ( 10 ) 

S.t. Q av < OO, H av < OO. (11) 

We denote the optimal solution value as f* v . Here the queue 
stability constraints are to ensure that the tasks and resources 
do not stay in the queue forever. This is important in many 
cases. For instance, in an energy harvesting network, it is im¬ 
portant to ensure timely packet delivery, or in a crowdsourcing 
system, it is desirable to keep the worker waiting time short. 

F. Discussion on the Model 

Due to the general matching reward function, our prob¬ 
lem is different from a flow utility maximization problem, 
e.g., Q, which is a special case when r n (t) = p n (t). By 
tuning the parameters of the model, our model can model 
all the examples in Section [II] For example, by choosing 
U total = the system models the revenue maximzi- 

ation problem in online advertisement systems. By choosing 
Vn{t) = l [bl( t )>b mi» ] l [b 2 (t)>b „i„ ] , our model can represent a 
cloud computing system, where a computing task requires two 
types of resources. 

Problem © is very challenging. First of all, the no¬ 
underflow constraint 0 requires a very careful selection of 

3 In this paper, we assume that all limits exist with probability 1. Our results 
can be extended to more general cases with lim inf or lim sup arguments. 


control actions, because actions in a slot can affect action feasi¬ 
bility in later slots. Problems of this kind are often tackled with 
dynamic programming, whose computational complexity can 
be extremely high when the action space is large. Secondly, the 
reward function r n (z(t), p, n (t)) is unknown and is dependent 
on Q n (t). This makes the problem very different from existing 
utility maximization works, e.g., 0,0. Thirdly, due to the 
more and more stringent user requirements on service quality, 
it is more desirable to ensure small queueing delay. 

IV. Optimal Matching 

In this section, we present our matching algorithms. We will 
first present an ideal algorithm, which assumes full knowledge 
of the reward functions r n (z(t),p ) and will serve as a basic 
building block. Even in this case, we will see that the problem 
is highly nontrivial due to the existence of the no-underflow 
constraint 0 and the dependency of r n (t) on Q(t). 

A. With Full Reward Information 

To start, we first introduce an auxiliary variable 7 n (t) £ 
[ 0 ; fmax] and create for each n a deficit queue d n (t) that 
evolves as follows: 

d n (t+ 1) = max[d„(f) - n n (t),0] +7 n (t), (12) 

with d(0) = 0. Note that the input into d n (t) is K n (t) instead 
of r n (t). The deficit queue d n (t) measures how much the 
actual reward profile is currently lagging behind the target 
value (due to randomness). 

Then, we denote f(t) = Yh n Un{ln{t)) — c{t) the in¬ 
stantaneous system utility minus cost and denote y(t ) = 
(Q(t ), H(t), d(t)). We also define a Lyapunov function as 
follows: 

m = \\\Q(t) 0r|| 2 + \\\H(t) - 0 2 || 2 + i||d(f)|| 2 , (13) 

where || • || is the euclidean norm and G 1 = 9\ ■ 1 A and 
9-2 = 6 2 ■ 1 M with l fc e R fc being the vector with all 
components being 1 , and 9\ and O 2 are constants that will be 
specified later. We define the one-slot utility-based conditional 
Lyapunov drift Ay(f) = E{L(f + 1) — L(t) — Vf(t) \ y(f)}. 
Using the queueing dynamics (|5j, ([6]>, and ( [T2| , we obtain 
the following lemma, in which V > 1 is a tunable parameter 
introduced for controlling the tradeoff between system utility 
and service delay (explained later). 

Lemma 1: Under any feasible policy, we have: 

A y (f)<G-U^E{U„( 7 „(f))-d„(f) 7 n(f) \y(t)} (14) 

n 

+ ^2iQn(t) - 0i)E{i?„(t) | y(t)} 

n 

+ Y,(Hm(t) ~ e 2 )E{h m (t) I y(t)} 

m 

+ E{Vc(t) - - 0 2 ) E (*) 

m n 

-E(<W*) “ d t)Pn{t) - '^2,dn{t)r n (t) | y(t)}. 

n n 

Here G = A(H 2 nax + p 2 max + 2r^ ax ) + Mh 2 mi „ + MN 2 b 2 mSiX 

does not depend on V, and the expectations are taken over the 
randomness in the system as well as in the policy. O 
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Proof: See Appendix A. ■ 

We now construct our algorithm by minimizing the right- 
hand-side (RHS) of the drift {M}. 

Reward optirnAI Matching (RAM): At every time t, ob¬ 
serve z(t) and y(t). Do: 

1) Quota: For each n, choose 7 „(t) by solving: 

max : VU n (y n (t)) - d n (f) 7 „(t), s.t. 0 < 7 „(t) < r max . (15) 

2) Admission: For each n, if Q n (t) < 9±, let R n {t) = 
A n (t)\ otherwise R n (t) = 0. Similarly, for each m, if 
H m (t) < 0 2 , let h m (t) = em{t)\ otherwise h m (t) = 0. 

3) Resource: Choose the resource allocation vector b(t) by 
solving: 


algorithm, we use /3f to denote the maximum “derivative” 
of the estimated r with respect to any b mn . Specifically, we 
assume that if b and b' are such that b' is obtained by setting 
one b mn in b to zero. Then, 

r n {z,y n (z,b)) < r n (z,y n (z : b')) +13 

r^mnt Vn. (19) 

Since both Z and {B z } are finite, we see that /3f exists and 
is 0(1) (possibly depends on r). 

Learning-aided Reward OptimAI Matching (lram): 

1) (Learning) Apply any (Tg r , Pg r , £ r )-learning module 
F r . Terminate at t = Tg r and output r. 

2) (Matching) Set Q{T Sr + 1) = 0, H(Tg r + 1) = 0, and 
d(Tg r + 1) = 0. Choose 0\ and O 2 according to: 


min : <fr r (b) = Vc(z, b) - y^(H m (t) - 0 2 ) Y 6 ™( 1 2 * * * * * * * * * * * * * 16 ) 


- ^ 2 (Qn{t) ~ 9i)y n (z(t), b) 

n 

- Y d n (t)r n (z(f),y n (z(t),b)) 


S.t. 


b € 23 z ( t ), Constraint ([l]) (17) 

4) Queueing: Update Q(t), H(t), and d(t), according to 
(|5j, ([ 6 }, and ([12|, respectively. O 
Note that in (I 61 we have used r n (z(t), y n (t)) instead of 
r n (z(t), p, n (t)). We will see in our later analysis that our 
algorithm automatically guarantees fi n (t) = This is 

very useful, for otherwise the algorithm performance will 
be very hard to analyze. We also emphasize here that the 
introduction of 0i and 0 2 are important. It can be seen in 
the admission step here that if 0 \ = 02 = 0 , i.e., without 0 \ 
and 02 , no task or resource will be admitted at the first place 
and the algorithm will not even proceed! 


B. With Reward Information Learning 

Here we consider the case when one does not have full 
reward information and provide an algorithm that can integrate 
general learning methods for estimating r. 

To also investigate the impact of learning on algorithm 
design and performance, we first define learning capability. 
Specifically, for any general matrix W and a learning algo¬ 
rithm r that outputs an estimation W, we denote its maximum 
estimation error by: 

s w = \\W- W\\ maK , (18) 

where ||a:|| max — max^ \xij\. Then, the formal definition of 
a learning module is as follows. 

Definition 1: An algorithm F is called a (Tg,Pg,S)- 
learning module, if (i) it completes learning in Tg time, (ii) 
it guarantees Prj^u, < <5} > Pg, and (iii) for any T > 0, Pg 
does not decrease if the algorithm is run for Tg + T time. O 

Here Tg can be both random or deterministic depending on 
the termination rules. This definition is general and captures 
key features of learning algorithms. With this definition, 
having perfect knowledge at the beginning can be viewed as 
having an ( 0 , 1 , 0 )-leaming module. 

We now present an optimal matching algorithm for general 
systems that do not possess perfect knowledge of r and need 
to rely on some learning algorithms for estimation. In the 


0\ — (tax + (V ft + ^max )Pr)/Pl + Mmax (20) 

02 = (V + fmax)/3f + f m ax/3^ + Nb mSiX . (21) 

Run RAM with r. O 

In LRAM, we explicitly separate the algorithm into two disjoint 
phases. This is chosen to facilitate presentation and analysis. 
Doing so also does not change the order of the overall 
algorithm convergence time and performance. We can also 
transform LRAM and DRAM below into continuous-learning ver¬ 
sions, e.g., 03, and update estimations from time to time, e.g., 
using sliding-window estimation or frame-based estimation. 
It is also worth noting that 0\ and 62 can be computed 
beforehand easily. This is a feature useful for implementation. 


C. With System State Information Learning 


In the previous section, we describe how the estimated re¬ 
ward information r can be naturally integrated into a matching 
algorithm. Here we consider the case when a learning module 
is also applied to learning the statistics of the system state 
z(t). Our result here generalizes the dual-learning approach 
proposed in iia to handle underflow and to allow general 
learning methods. 

To start, we define the following optimization problem: 


max: <j> 4 U[y^ Unj-Jn) — Cost] (22) 

n 

s.t. 7n <r n = '^2'rtkrn(z k ,^ n (zk,b k )) (23) 

k 

Cost = ^2 ttkc(z k ,b k ) (24) 

k 

^2tt k R^ = ^2Tt k fi n (z k ,b k ), Vn (25) 

k k 

7r * Y b Y = Y nkh v m (26) 

k n k 

0 < 7 n < A nax, b' € Bk (27) 

0 A R k < A k ,0 A h k A e k . (28) 


Problem 


can intuitively be viewed as a way to solve our 


matching problem. The equalities in (25 i and (26 1 are due to 
fact that only tasks that are actually served generate reward and 
the no-underflow constraint {!]). However, a scheme obtained 
by solving ( f 22 ) may not be implementable due to (i) it 
ignores the no-underflow constraint, and (ii) it assumes that 
all resources allocated are fully utilized, i.e., using fj, n (z k ,b k ) 
in (231. We will also see later that, it requires a much larger 
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learning time for such a scheme to have the right statistics for 
achieving a performance comparable to ours. 


We now obtain the dual problem for (22) as follows. 

min : g(a d 1 a q , a h ) s.t. a d y 0, a q £ M. N , a h £ R M , (29) 

where g(a d ,a q ,a h ) = XE 7r k9k(& d , °t q , cx h ) is the dual 
function, and g k (ot d , a q , a h ) is defined as: 

g k (a d , ct q , a h ) = sup iV[V' U n {^ n ) - Cost] (30) 

R,-y,b,h l n 

-^at\ r n{zk,Vn{z k ,b)) - 7 „] 


^2 a n [Rn ^ ' ^mn h-n 


Note that g k (cx d , ct q , a h ) is indeed the dual function for state 
z k . With ( p9| , we now present our algorithm, which integrates 
system state information learning into control. 

Dual learning-aided Reward optimAI Matching (dram): 

1) (Learning) Apply any (T$ r . Pg r , <5,.(-learning module F,. 
for r and any (Ts z , Ps z , 6 Z )-learning module Y z for z. 
Terminate at T k = ma x(Tg r ,T$ z ) and output r and iv. 
Choose 9 1 and 9-2 according to: 

9\ = (iimax + (V ft + ^ max )/3 r)/fi l fi + Ahnax (31) 

9-2 = {yP + r ma x)/3f + fmax/3^ + AT max . (32) 

2) (Dual learning) Solve the empirical dual problem: 

min: ^ Tv k g k {a d 1 a q - G u a h - d 2 ) (33) 


s.t. 


a 


l N , a h e 


.M 


y 0,a q £ 

Here g k (a d , a q 6 1 , a h — On) is defined in (30i with r. 


Denote the optimal solution as a* = (a d * , or , a*). 

3) (Matching) Set Q(T L + 1) = 0, H{T L + 1) = 0, and 
cI(Tl + 1) = 0. For all t>T k + 1, define: 

Q(t) = Q(t) + a q *-C N (34) 

H(t) = H(t) + cx h *-C m (35) 

d{t) = d(t) + ot d * - Cn- (36) 

where C k i s a vector in R. k with all elements being C — 
2max(<5 z Hlog(l/) 2 ,log(H) 2 ). Run RAM with r, Q(t), 
H(t), and d(t). If the resulting b(t) from (16 1 violates 
([l] for some m, change {b mn (*)}new to {b mn 
with J2n bmn(t) = h m (t) and drop p n (z(t), b(t)) tasks 
from each n that has b mn (t ) > 0. O^] 

DRAM first utilizes learning to obtain an empirical distri¬ 
bution, which is usually crude but fast at the beginning. 
Then, it transforms to queue-based control (or more generally, 
stochastic approximation), by obtaining an empirical optimal 
multiplier via dual learning. It then starts from the empirical 
multiplier and rely on queue-based control to learn the true 
optimal operation point. The procedure is shown in Fig. [2] This 
combination avoids the slow convergence regime of statistical 
learning and the slow start of stochastic approximation, and 
algorithms so developed can achieve much faster convergence 
and superior delay. 


4 In actual implementation, one can still serve the tasks with the actual 
allocated resource. 


Statistical Statistical 

Learning Learning 

(Fast & Crude) (Slow & Accurate) 


1 

- -► 

Dual 

Learning 

r r 



Stochastic 

Stochastic 

Approximation 

Approximation 

(Linear) 

(Linear) Time 

- ► 


Fig. 2. DRAM combines the fast regime of statistical learning and the smooth¬ 
ness of queue-based control (more generally, stochastic approximation). 


Also note here that the dropping step is introduced to ensure 
zero underflow, by giving up the service rates and reward at 
that particular slot. We will see in later analysis and simulation 
that such an event almost never happens and hence does not 
affect performance. 


D. Sampling-based Learning Module 


Here we describe two sampling-based learning modules 
for estimating r and 7r. We first describe a threshold-based 
sampling module for estimating r. In the module, we use 
s(z,b,t) to denote the number of times the pair (z,b) is 
sampled, i.e., adopt b when z(t) = z, up to time t. We also 
denote s m i n (f) = min (Zib ) s(z , b , t). 

Threshold-Based Sampling TBS (s t h)' Every time t, sam¬ 
ple the resource allocation vector b* £ argmin b s(z(t), b, t) 
until s m i n (t) > Sth- If terminate at T tbs , output r n {z 1 b) = 

EST 1 [z(t)=zMt)=b\Kn(t)/s{z,b,T tbs ). O 

Here I is the indication function of x. The learning mod¬ 
ule TBS(s tb ) is very intuitive. It tries to balance the sampling 
frequencies of all (z, b) until every pair is sampled at least Sth 
times. In this case, the learning algorithm running time T tbs 
is random. In the following, we look at a deterministic time 
sampler for estimating tv. This module sets a sampling time 
threshold T tb . 

Time-Limited Sampling TLS (T th ): Observe z(t) for T th 

slots. Output 7 f k = Ef=i l[z(t)=z k ]/ T th for all k. O 

The following lemma shows the performance of the two 
modules. 

Lemma 2: The two learning modules satisfy: 


(a) TBS (s th ): E{T tba } = 0(s tb ) and with probability 1 

_ &th ^°s( s th j_ _ . 

0(e 2( - a th r Lx+'-nuv^ft/T ) 5 S r = . 

T th log(T th ) 2 

(b) TLS (T th ): With probability 1 - 0(e 

V log (Tth) /\ 

° z - -/Tth ■ v 


Proof: See Appendix B. 

By choosing sth = T t h = V c , we can guarantee 6 r = S z 
c\og(y)V~ c t 2 with P Sr and P Sz being 1 - 0(V~ lo sW)). 


). 


V. Performance Analysis 

In this section, we present the performance results of LRAM 
and DRAM. We start with some definitions and assumptions. For 
notation simplicity, we denote a = (ok 1 . a q . n h ) and write 
a — 0 = ( a d , a q — 0i, a h 6 2 ). Also, to indicate the different 
distributions and reward functions used, we use g(a) to denote 
the dual function when r is replaced by r in (|30) and the 
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distribution is given by 7r. Then, we use g v (a) and g n (a) to 
denote the dual function with distribution ir and r, and with 
it and r, respectively. 


A. Preliminaries 


We define the following polyhedral system structure: 
Definition 2: A system is polyhedral with parameter p > 0 
if the dual function g(a) satisfies: 


g(at*) < g(a) - p\\a* - a||. (37) 


Here a* is an optimal solution of ( |29) . O 
The polyhedral structure typically appears in practical systems, 
especially when the system action sets are discrete (see l20l 
for more discussions). Note that (( 37 ) holds for all V if it holds 
under any V, in particular V = 1 . 

In our analysis, we make the following assumptions. 

Assumption 1: There exist constants e r ,e z = 0(1) > 0 
such that for any valid state distribution 7r and reward statistics 
f with ||-7r — 7r|| < e z and ||r — r|| < e r , there exist a set of 
actions {7 fe }fc=i,...,M 


\ Z V 


r 1 i=l,2,...,oo 
fk=l,...,\Z\ ' 


k\i=l,2, 
l i Jk= =1,.. , 

k (possibly depending on ff and r), such that: 


f L. k~\ 2=1,2.....OO 1 

and 


{h-i Y k YY' \z\’ and variables Af > 0 with JA A^ = 1 for all 


In >^ifn{z k , p n (z kl b^)) < -go, (38) 

k i 

where go = 0(1) > 0 is independent of ir and r, and that 

E ** E x * R in = E ** E ^ m**, 6?) (39) 

k i k i 

E^E A "E & ^n = E^E A ^ (40) 

kin k i 

where 0 < Ylk^kYli R in < E , r {A n (t)} as well as 0 < 

Yhk Si A i^im < E7r{e m (f)}. O 

Assumption 2: For any if and r with |ir - 7r|| < e z and 
||f — r|| < e r , if g(a) is polyhedral with parameter p, then 
g 71 (a) is also polyhedral with parameter p. O 
Assumption 3: For any if and r with ||if — 7r|| < e z and 
llr — rll < e r , q n (a) has a unique optimal solution over 
R m + 2N . O 

In the network optimization literature, e.g., DU, ED, 
Assumption |T| is commonly assumed with e r = e z = 0. 
By allowing e r ,e z > 0, we assume that systems with sim¬ 
ilar statistics have similar stability regions. Assumption [2] 
assumes that systems with similar statistics share a similar 
dual structure. This is not restrictive. In fact, when action 
sets are discrete, it is often the case that g k {ct ) is polyhedral, 
which usually leads to a polyhedral structure of g T (a). The 
uniqueness assumption is also often guaranteed by the utility 
maximization structure, e.g., E2- 


B. Queue and Utility Performance 

We first summarize the performance of LRAM. 

Theorem 1: Suppose l's r < oo with probability 1. Under 
LRAM, we have for all t > Tg r + 1 that: 


d„(t) 

< 

^max = V (3 ~\~ V max •> V 77- 

(41) 

Quit) 

< 

Qmax = Oi ~\~ r max , V 77- 

(42) 

H m (t ) 

< 

^max — O 2 ^7-max? V 777.. 

(43) 


Moreover, we have with probability Pg r that, 

> /* - G + V ™* Sr - 2NBS r . O (44) 

Proof: See Appendix C. ■ 

The last term in (j44) involves the estimation error 5 r . This 
can be viewed as the performance loss due to inaccurate 
reward information. We remark here that the deterministic 
queueing bounds are important for both algorithm implemen¬ 
tation and performance guarantee. This is so because errors 
in reward function estimation will be amplified by the queue 
sizes when used in decision making, i.e., <[T6). 

We now present the performance results for DRAM. 
Theorem 2: Suppose rna.x(7A r , Tg z ) < oo with probability 
1. Suppose g(ot) is polyhedral with p = 0(1) > 0, and that 
& z < e z and S r < e r , and a*+9 >~ 0. Then, with a sufficiently 
large V, we have with probability Ps z Pg r that, under DRAM, 

C m >f:^^~0(l/V + S r ). (45) 

Also, the fraction of time dropping happens is 0(V'~ 
Moreover, for all queues, there exist 0(1) constants D,I\,a 
such that: 


Pr{d„(f) > + D + u} 

< ae Kv 

(46) 

Pr {Q n (t) > + D + u} 

< ae~ Kl ' 

(47) 

Pr {H m {t) > ~C + 79 + uj 

< ae~ K r 

(48) 

Thus, all queues are stable. O 



Proof: See Appendix D. 


■ 


Note that if we have S r = 0 with Ts r = 0 and Pg r = 1, 
then Theorem [T] recovers the known \(){ 1 /V). ()( V)] utility- 
delay tradeoff for stochastic network problems fl2l . On the 
other hand, if we also have S z = 0 with Tg z = 0 and Pg z = 1, 
then DRAM provides a new way for achieving the near-optimal 
(0(1/V), 0(log(U) 2 )] utility-delay tradeoff. 

In both LRAM and DRAM, it is possible to continuously update 
the reward function estimations during the control steps. 
However, this does not automatically guarantee that we can 
always eliminate the effect of S r . This is so because the initial 
estimation r may affect what options will be continuously 
updated later. On the other hand, the same performance results 
will hold if further updates do not increase S r . 

C. Convergence time 

Here we look at another important performance metric - 
algorithm convergence time, which characterizes the time it 
takes for the algorithm to enter the “steady state.” Faster 
convergence implies better robustness against system statistics 
changes and higher efficiency in resource allocation, and is 
particularly important when system statistics can change. The 
formal definition of convergence time is as follows ins. 

Definition 3: For a given constant D, the 79-convergence 
time of a control algorithm n, denoted by Tj- J, is the 
time it takes for the queue vector (d(t),Q(t),H(t)) 
((d(t),Q(t),H(t)) under DRAM) to get to within 79 distance 
of a* + 6, i.e., 

To — inf{t | ||(d(f), Q(t), H{t)) — (a* + 0)|| < 79}. 0(49) 



This definition of convergence time concerns about when 
an algorithm enters its “optimal state.” It is different from 
the metrics considered in f23l and {24l . which concern about 
the time it takes for the objective value and constraints to 
be within certain accuracy. With Definition [3] we have the 
following results: 

Theorem 3: Suppose g(ot) is polyhedral with p = 0(1) > 
0, S z < e z and S r < e r , and a* + G >~ 0. Then, with a 
sufficiently large V, we have: 

E{T“ AM } = 0(T Sr +Q(V))w.p. P Sr (50) 

e | t dram| = o((T,+Q(6 x V))w.p.P Sr P s .. (51) 

Here 7} = max(7), ., Tg z ) denotes the total learning time in 
DRAM, £>i = G{6 r V) + 0(1), and D 2 = Q(6 r V) + 0(1). O 
Proof: See Appendix E. ■ 

Combining Theorems [I] [2} and [3] we see that S r largely 
affects the overall utility performance (reflected by l) t and 
D 2 ), while S z can greatly improve the convergence time and 
delay! This indicates that information of different system com¬ 
ponents can play very different roles in algorithm performance 
and learning accuracies should be carefully chosen for meeting 
a desired performance goal. 


D. Necessity in Controlling S r 

Here we provide a simple example showing that it is 
necessary to control 5 r for good utility performance. Hence, 
it is important to learn the reward value for each matching 
option. Consider the case when N = 2 and M = 1. 
Suppose e(f) = Ai(t) = A 2 (t) = 1 for all t. Also suppose 
6(f) € {(0,1), (1, 0), (0, 0)}, that is, at every time f, we can 
only allocate resource to one or zero queue. Suppose c(t) = 0, 
p n (t) = b n {t), and r n (t) = p, n (t). Finally, assume that 
Ui(n) = log(l +n) and log(l + 2 r 2 ). 

In this case, the true optimal takes place at rT = 1/4 and 
f 2 = 3/4 with (/total = 0.7828. Now suppose we incorrectly 
estimate the rewards to be r n (t) = (1 + 8 n )jl n {f). Then, one 
can show that the optimal rewards become: 

2(1 + <5i)(l + 62) - 2(1 + 82) + (1 + <5i) 


r 1 = 


T2 = 


4(1 + <50(1 + S 2 ) 

2(1 + <50(1 + S 2 ) + 2(1 + S 2 ) - (1 + <5i) 


4(1 + <5i)(1 + <5 2 ) 

1 + 25 1 - 8 2 and r2 ^ 


(52) 


(53) 


which is roughly r\ 
the resulting optimal utility is given by: 

12^1 +|<J 2 


(/total « 0.7828 - 


™±=h.. Thus, 


(54) 


Therefore, in order to obtain an 0(1/1/) close-to-optimal 
utility, it is necessary to ensure that max(|<5i|, |<5 2 |) = 0(1/1/). 


VI. Simulation 

In this section, we present simulation results for our algo¬ 
rithms. We consider a system that has N = 2 and M = 1. We 
assume that e(f) is 0 or 2 with equal probabilities. Ai(t) is 0 
or 2 with equal probabilities and A 2 (f) is 1 or 2 with equal 
probabilities. 6(f) G {(0,1), (1,0), (0,0)}, i.e., at every time 
f, we allocate one unit resource to one or zero queue. We set 
c(f) = 6i(f)+6 2 (f) and p n {t) = 6„(f), and 7 „(f) G {0,1,2}. 


There are two system states ui(t) € {1,2}. In each state, the 
reward functions are given by r n (t) = w n (uj{t))fi n (f), where 
uti(l) = 0.8 and w 2 ( 1) = 1 and uq(2) = 1 and w 2 (2) = 0.8. 
Thus, the system state indicates which tasks are preferred 
under the specific condition. Every time the corresponding 
reward is either 0.5r„(f) or 1.5r n (f), with equal probabilities. 
Finally, we assume that U\{r\) = 1.21og(l + 2ri) and 
U 2 {r 2 ) = 1.2 log(l + 4r 2 ). 

From the definitions, we have that /3 = 5, /3“ = /3^ = 1 and 

P? = max w ( ( ) in r n (ui(t)). We also set r max = 2, p max = 1 


and 6„ 


= 1 , K 


= 2. We simulate both LRAM and DRAM 


with V = {10,20,50,80,100}. According to (20> and (21 1 , 
0r = (5V + 2)fy + 4 and 9 2 = (5V + 2)/3? + 3. In DRAM, we 
set ( = log(l/) 2 . We use TBS(sth) to estimate r and set Sth = 
log(l/) 2 , and use TLS(T t h) to estimate it and set T t ,h = Ttbs- 
In LRAM, we randomly add or subtract the estimation error 5 r 
from the true values. 




Fig. 3. Utility performance and task queue size. 


Fig. 0 first shows the utility performance and the task 
queue behavior of LRAM and DRAM, where the number after 
LRAM denotes 8 r . We see from the left plot that except for 
<5 r = 0.1, LRAM performs very well under all other error values, 
suggesting that estimation error indeed plays an important 
role in system utility. We also see that DRAM performs very 
well starting from V > 50. The right plot shows the backlog 
(delay) performance under different schemes. It is evident that 
DRAM achieves an 0(log(l/) 2 ) delay in this case, while all 
other variants possess an 0{V) delay. This demonstrates the 
importance of incorporating system dynamics information into 
algorithm design. 

Fig. 0 then shows the resource queue 77(f) and deficit 
queues d[t). We see that DRAM ensures an 0(log(l/) 2 ) average 
resource queue, while other algorithms result in an 0(V ) 
queue size. This implies that DRAM ensures a very short 
stay in the system for the resource items! This feature is 
particularly useful if the resource items are human users, e.g., 
in crowdsourcing. 

Finally, Fig. [5] shows the convergence behavior of the algo¬ 
rithms for V = 100. Here we show the resource queue value 
as its convergence time dominates the others. We see that RAM 
takes an 0(V) time to converge, which is expected. We also 
observe that 77(f) under LRAM-0.05 and LRAM-0.1 converge to 
values slightly above those under RAM. This explains why their 
performance is slightly worse. On the other hand, we also see 
that DRAM converges quickly. The reason its steady state value 
is slightly above that under RAM is due to the inaccuracy of 
r. Even in this case, we see that there is a 2.5x convergence 

















9 




Fig. 4. Resource queue and deficit queue sizes. 

speedup (most of the learning time is due to sampling) and 
DRAM achieves very good performance. In the case when r 
can be obtained from other data source beforehand, which can 
commonly be done in practice, e.g., in online advertising, we 
see that DRAM (called state-only DRAM in this case) achieves a 
lOx convergence speedup (50 slots vs. 500 slots)! 



Fig. 5. Convergence of LRAM and DRAM for V = 100. 

We observe in all simulation instances that no dropping 
occurs. This demonstrates the effectiveness of the algorithms 
and validates Theorem [2] 

VII. Conclusion 

In this paper, we study the problem of optimal matching 
with queues in dynamic systems. We develop two online 
learning-aided algorithms LRAM and DRAM for resolving the 
challenging underflow problem and to achieve near-optimality. 
We show that LRAM achieves an 0(e + S r ) system utility, 
for any e > 0, while ensuring an ()(1 /e) delay bound and 
an 0(l/e) algorithm convergence time. DRAM, on the other 
hand, guarantees a similar 0(e + 6 r ) system utility, while 
achieving an 0(<5 z /e+log(l/e) 2 ) delay bound and an 0(6 z /e) 
algorithm convergence time, which can be significantly better 
compared to LRAM when 8 Z is small. Our algorithms and results 
reveal the interesting fact that different system information 
can play very different roles in algorithm performance and 
provide insights into joint learning-control algorithm design 
for dynamic systems. 
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Appendix A - Proof of LemmaQ] 

We prove Lemma T here. 

Proof: (Lemma T i Using the queueing dynamics 0 and 
(|6|, we have: 

(O n (i + l)^0!) 2 <(Q„(<)-0i) 2 

-2 (Q n (t) - d 1 ){n n (t) - R n {t )) + R n {t) 2 + Pn(t) 2 . 
Similarly, we get: 

(H m (t +1) - e 2 f < 
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~2(H m (t) - 9z)(^ 2 b rnn(t) - h m (t )) + b mn 


and that 

d n (t + l ) 2 < d n (t) 2 - 2 d n (t)(K n (t) - 7 „{t)) + Unit) 2 

Summing the above and using the definition of L(t ) and 
A v {t), we get: 

L(t + 1) - L{t) - Vf(t) 

<G-y(^C/„( 7 n(f))-c(f)) 

n 

- y~l(Qn(f) - ^l)(Atn(f) - -Rn(f)) 

n 

- X! d n{t)(K n (t) - 7„(f)) 

n 

- ^2( H m(t) - 0 2 )(£ b mn (t) - h m {t)). 

m n 

Here G A N{A^ + ^ + 2r^ ax ) + Mh 2 max + MN*b 2 max . 

Rearranging terms, we obtain: 

L{t + T)-L{f)-Vf{i) 

< G - V 5 ^[Un{ln(t )) - d n {t)') n {t)] 

n 

+ — 9i )Rn(t) 4- YXH m (t) — 0 2 )h m {t) 

n m 

b mn (f) 


(f )) 2 + Choosing b = t/sth\og{s t h), dividing both sides of the in¬ 
equality inside by s{z,b,T tbs ), and using s(z,b,T tbs ) > s th , 
we get: 

s th U ‘S(.i t h ) 2 


Pr{f ra (z, b) < r n (z , b) - log ^^ } < e 

V S th 


s th+ r 


-Hh/ 3) 


- y^(Qn(£) - )Vn(t) - d n (t)n n {t) 


£< 


1=1 


Using Theorem [4] with —X, we get a similar bound for the 
other side. Hence, 

iog(sth) -- 3thl ° 8<3th) 


Pr{|f„(z, 6 ) - r„(z, b)\ < 


yj Sth 


-} < 2e“ 


s t/i+ r ma 


7T/3) 


Using the union bound, we see that Part (a) follows. Part (b) 
can be proven similarly. ■ 

Appendix C - Proof of TheoremQ] 

We present the proof for Theorem [I] here. For our analysis, 
we will use the following result, which is Theorem 1 in 


Taking an expectations on both sides conditioning on y(t) 
and using the fact that K n (t) is an i.i.d. random variable given 
z(f), b(t), Q(t), we see that the lemma follows. ■ 

Appendix B - Proof of Lemma[2] 

We prove Lemma [2] here. In our proof, we will use the 
following theorem from l25l . 

Theorem 4: ll25l Suppose X, are independent random vari¬ 
ables satisfying X, < B for 1 < i < n. Let X = JA X t and 

Ill'll = yjj2i E {^ 2 }- Then we have: 

b 2 

Pr{X < E{X} - b} < e mXXTMJv .O (55) 

Proof: (Lemma Ul First of all, we show that K{T tbs } = 
0(log(U) 2 ). To see this, notice that since each Bk is finite, 
if the state z(t) = k appears \Bk\sth times, we must have 
sampled every b £ Bk Sth times. Thus, 

Ttbs < E^T{visitz fc \Bk\sth times}. (56) 

fc 

Taking the expectations, we see that E{T t f> s } < sth TFy- 
Then, we see that in r , each single value has been sampled 
Sth times. Using Theorem [4] and the fact that K n (t) < r max , 
we get: 

T tb s 

ME 1 ! *(t)=*, 6 (t)= 6 ]«nW < r n (z, b)s(z, fr, Ttbs) - b} 


< e 


x.b/3) 


Theorem 5: 1261 Let a:* be an optimal solution of (29). 
Then, g{a*) > Vff. O 

Proof: (Theorem[lji (Queueing) First consider d n [t). We 
see that if d n (t) < V(3, then d n (t + 1) < V(3 + r max . On 
the other hand, from ( p~5| ), whenever d n (t) > V/3, 7 n (t) = 0. 
Hence, d n (t) will not further increase. This proves the bound 
for d n (t). 

The bounds for Q n {t) and H. m {t ) can be similarly proven by 
noticing that LRAM will not further admit tasks once Q n {t) > 
9 1 and it does not admit resources once H m {t) > 0 2 . 

(Utility) We carry out the proof by comparing the RHS of 
( [T4| ) under LRAM with any other control policy, including the 
ones that do not respect the no-underflow constraint 0 - 
To this end, look at ( p~ 6 | >. We want to show that even without 
constraint 0, LRAM ensures that (i) whenever H m (t) < 
Af&max, b mn{t) = 0 for all n, and (ii) whenever Q n (t ) < 
Mmax, p-nit) = 0. We first show (i). Suppose H m [t) < JV 6 max . 
We see from ( | 2 T] i that: 

H m (t) ~6 2 < -(Vp + r max )/3f - r ma x/3;. (57) 

In this case, denote the optimal resource allocation vector as 
b* and suppose there is one n with h) rirl > 0. Let b be the 
vector obtained from b* by setting b’^ nn = 0. We have: 

Mb*) - *f (b) (58) 

= Vc(z, b*) - Vc(z, b) - (. H m (t) - e 2 )b* mn 

-( Qn(t ) - 0 l)(/Lt„(z(f), b*) - y n {z{t),b)) 
—d n (t)(r n (z(t),fi n (z(t), b*)) - f n (z(t),fi, n (z(t),b))) 
> [(V/3 + r max )/ 3f + 

-{Vp + r max )Prb^ n = 0 . 

In the inequality, we have used the fact that Q n (t) — 9\ < 
r-max, Pn(z(t), b*) - /x„(z(f), b) < d n (t) ^ V/3 + r max , 
and r n (z(t),ji n (z{t),b*)) - r n {z(t), p n {z{t),b)) < / 3?b* mn . 
However, ( |58| ) contradicts with the fact that b* is the minimizer 
of \I/j 3 (fo) and shows that we must have b’^ nn = 0 Vn whenever 
H m (t) < Nb max . 

Now we look at the case of Q(t). Suppose Q n {t) < p max . 
Then, we have: 

Qn(t) -9i < ~{h max + {V/3 + r max )/3f)//3 l ■ 


(59) 
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We similarly let b* be the optimal solution. Then, we construct 
6 by setting b* m * n > 0 to zero, where in* = arg min r „ b* nn . 
In this case, if y n (z(t),b*) = 0, we are done. Otherwise, 

*f(&*) - *<>(&) (60) 

^ xb m * n — (V f3 + r m a x )/3fb m a. n 

+ ((/lmax + (V /3 + f m ax)^f)//3^)/3^6m*n = 0- 
The inequality follows since H m (t) — 62 < h max , d n (t) < 
V/3 + t'maxj and that y n (z(t),b*) > 0, which implies 
A tn(z(t),b*) > Plb^n This contradicts with the fact that 
b* is the minimizer and shows that whenever Q n (t ) < /i lnax , 

K = 0 . 


These two properties imply that LRAM automatically guar¬ 
antees that the no-underflow constraints are satisfied for all 
H m (t) and that we always have p, n (t) = min[Q n (£), //„(£)] = 
H n {t). To compare our control policy with any other matching 
policies for the drift (14 1 , we still need to show that the actions 
under LRAM, which is based on the estimated reward matrix r, 
ensure that the RHS of ( fl4| , defined with the true reward r, 
is approximately minimized. 


To do so, first observe that 7 (i), R(t) and h(t) are 
optimally chosen given y(t) and z(t). Hence, the only approx¬ 
imation comes from choosing 6 (f). Let 6 £(t) be the chosen 
vector under LRAM and let b*(t) be the vector chosen if r is 
used. We have: 

*?(&?(*)) < *r(K(t)) (61) 

= *rm ))+£ d n m n (bm - r n (b*m . 

n 

On the other hand, we also have: 

(&?(*)) = ^M(t)) + £ d n (t)[r n {b* f (t)) - r n (b;m. 

n 

Combining the above two equalities and using the fact that 
T r is a (Tg r , Pg r , 6 r ) -learning module, we see that with 
probability Pg r , 

< \l/ r (6*(f)) + 2 £ d n (t)5 r 


Taking a limit as T —> 00, and using Jensen’s inequality and 
the fact that U n (r n ) is concave, we get: 

£t4(7j-c> ft - G + r ™* 5r - 2N(35 r . (64) 

n 

Finally, using the fact that d{t) is bounded, which implies 
7n < r n for all n, and that U n (r) is increasing, we see that 
the theorem follows. ■ 

Appendix D - Proof of Theorem [2] 

Here we prove the performance of DRAM. We carry out 
the analysis in the following steps. First, we show that the 
estimated optimal multiplier a = (a d *, at q *, a h *) is close to 
the true optimal. Then, we show via drift-augmentation that 
the definitions of Q{t), H(t), and d(t) ensure a near-optimal 
algorithm performance. 

We now have the following lemma for the first step. In the 
lemma, we denote a* the optimal solution for g(a), which is 
g(a) with tv and r. 

Lemma 3: Suppose g(ot) is polyhedral with p = 0(1) > 0, 
and that 5 Z < e z and S r < e r . Then, with probability Pg z Pg r , 
we have: 


II A* — A* || < 

2 W max t? 

P 

(65) 

ll«*-«l < 

2 Vf max S r 

( 66 ) 

PV 

where $ = |- 2 ^ 1(1 H- r max + -A max 
and r] = 0(1). O 

Mmax H - ^^max 

H” ^max)/ V 

Proof: See Appendix F. 


■ 


In our analysis, we make use of the following two results. 
Lemma 4: ED Let Q(t ) be the size of a single queue with 
dynamics Q(t + 1 ) = [Q{t) — p(t)] + + A(t). Suppose 0 < 

li(t),A(t) < ft max = 0(1) for all f and that the queue is 

stable. Then, 

p{t) - Aft) < At max Pr{Q(f) 

< '' /^max }■ (67) 

Here x(t) = limT^^ T E{a;(f)}. O 

Theorem 6: Under LRAM with reward functions r, there 
exist 0(1) constants a, I\ , and D, such that, 

V{D,v) < ae~ Kv , ( 68 ) 


< ^ r (b* r (t)) + 2N(Vp + r max )S r . (62) 


This shows that the RHS of ( fl4| under LRAM is minimized to 
within 2A r (U/3 + r max )5 r ., over any other policies. Comparing 
this to gk(a d , a q , a h ) in (30 1 and using the definition of 
g(a d , a q , a h ), we conclude that: 


A vif) = E{L(t + l)-L(t)-Vf(t)\y(t)} 

< G + 2N(Vf3 + rnax )S r - g(d(t),Q(t), H(t)) 

< G + 2N(V/3 + r mSLX )S r - Vft- (63) 


Here (a) follows from Theorem [5] Taking an expectation over 
y(t) on both sides and carrying out a telescoping sum from 
f = 0, ...,T — 1, and dividing both sides by VT, we obtain: 

= (EE E{0n(7n(t))-C(t)} 

t —0 t —0 n 

r* G + r max S r Af 
> /av - y -2 N/36 r . 


where V{D,v) is defined as: 

P(D,u) = lim Pr{||(d(f), Q(t), H(t)) - (a* +0)11 > D + v}.0 

t—>00 k J 

Proof: Similar to the proof of Theorem 1 in 8201 . Omitted 
for brevity. ■ 

Proof: (Theorem [2]) To prove Theorem |2j we define the 
following drift-augmentation term: 

A a {t) = - Y^n - C)E{^(i) - Rn(f) \ V (t)} (69) 

n 

- £(«n* - C)E{r„(f) - 7 nit) I y(t)} 

n 

-£(«m ~ ()E{Y b mn(t) ~ h m {t) \ y(t)}. 

m n 

Adding it to both sides of ( [T4 ) i, we get: 

Ay(f) + A„(f) (70) 

< G - u£E{t/„( 7 „(f)) - d n {t)l n (t) | y(t)} 
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+ ^2(Qn(t) - 6 >i)E{f?„(f) | y(t)} 

n 

+ YXH m (t) - d 2 )E{h m (t) I y(t)} 

m 

+ E{Uc(f) - Y(Hm(t) - d 2 )Y, b mn(t) 

m n 

~Y^ n ^ ~ e i)^n{t) ~Y^n( t ) r n( t ) I !/(*)}• 

n n 

Note that ( [70] ) also holds under our dropping action, because 
it is equivalent to modifying the dynamics of H(t) to H m (t + 
1) = (^rriW-E n ( t ) & mnW) + +^(<). This is important, for 
it allows us to analyze the performance with the drift analysis. 

Using Lemma [3] we know that with probability 

Ps r Ps z , ||a* - a|| < Using C = 

2 max(5-U log(U) 2 , log(U) 2 ), we see that when V is 
large, we have: 

2 8 z Vf max d < ^ (71) 


Thus, 


P 


a* - ^C<a -C<a* - 


(72) 


where the inequality is taken entry-wise. This implies that with 
probability P$ r Pg z , for each n, 

d n (t) < max(U/3 + r max , a%* - C/ 2 ) 

L ^max = max(U 3 + r'max> 3 /max/t) C/ 2 )- 

The second inequality uses ( f 8 T| ) in Appendix F. We now carry 
out a similar argument as in the proof of Theorem [T] and 
conclude that: 


that a* = 0(V) by (81 1 , we conclude that 

liniT-*x> 5 - Y^J=o E{A a (t)} = 0(1/V). Moreover, since 
d(t) is stable, f n > j n . Finally, using ( |73| > - ( [75] ), we see that 
the fraction of time dropping happens, i.e., when the claimed 
reward does not count, is 0(1/U 2 ). Since (3 = 0(1), this 
results in an additional utility loss of ()(1 /V 2 ). Hence, we 
conclude that: 

G + 2N d max 5 r 


Y U n(fn) - C > A*, - 
n 

This completes the proof. 


V 


-0( 1/V). (76) 


Appendix E - Proof of Theorem[3] 

We prove Theorem [3] here. 

Proof: (Theorem [3J To prove the result, we define a 
different Lyapunov function as follows: 

L 0 (t) = lj\(d(t),Q(t),H(t))-e-a*\\ 2 . (77) 

Then, we define a one-slot conditional Lyapunov drift as 
Ao(t) = E {Lo(t + 1) — L 0 (t) | y(£)}. Using the queueing 
dynamics, we obtain that: 

A 0 (t) < G (78) 

- - (<?"(*) - 01 )) E {Pn(t) - R n (t) \ V (t )} 

n 

~Y^n - dn(t))E{r n {t) - 7 „(f) I y(t)} 

n 

- Y&t - - 0 2 ))E{ Y b mn (t) - h m (t ) I y(t)}. 

m n 


Ay(f) + A a (f) < G-2N5 r d max - g(d(t),Q(t),H(t)) 
< G - 2NS r d lnax - Vf; v . 


Using the fact that the last three components constitute the 
subgradient of g(a) at a = (d(t),Q(t),H(t)) ll27l . we 
obtain: 


Carrying out a telescoping sum and taking a limit as in 
Theorem [T]s proof, we obtain: 

T-l 


Y U n^n)-C>f: v - 


G + 2Nd max 5 r 
V 




T->oo T 


t=0 


It remains to show that all the queues are finite, and that 
limr-j-cx) t E{A a (f)} = 0(1/U). Then, we can con¬ 

clude r n > 7 n and completes the proof. 

To this end, we first use ( 681 and the definition of d(t), 
Q(t), and H(t), to obtain that: 

3 

Pr{d„(i) > -£ + D + i/} < ae 

Pr{Q„(f) > + D + v) < ae 

Pr{H m (t) > + D + v) < ae 

which are exactly the queueing probability bounds ( |46| , ( [47] ), 
and (48!. Using ( | 68 ] > again, we see that for a large V such that 

C A P> + /r max + Nb max + v max + 2 log(U)/A', 


-Ku 


-Ku 


-Ku 


^max) 

< 

a 

V 2 

(73) 

Pr^ Mmax} 

< 

a 

V 2 

(74) 

Pr {H m (t) < iV^max} 

< 

a 

V 2 ' 

(75) 

Combining the above bounds 

and 

Lemma 

|4j and 


Ao (i) < G-(j((d(t),Q(t),/r(i))-0)-j(a*)) 

< G — p\\(d(t),Q(t) 1 H(t)) — 0 — a* 11. 

(Therefore, for any 0 < eo < p, if \\(d(t),Q{t),H(t)) — 0 — 
a* 11 > ’ then the above implies that: 

E{||(d(t + 1), Q(t + 1), H(t+ 1)) - 6- a *|| 2 | y{f)} 

< (\\(d(t), Q(t), H(t)) — 6 — a*11 — e 0 ) 2 , 

which further implies that: 

E{||(d(t + l),Q(t+l),H(t + l))-0-a*|| | y(t)} 

< \\(d(t),Q(t),H(t))-d-a*\\-e 0 . 

Then, using the fact that a* = 0(V) ||20| , (d(T$ r + 
1), Q(T Sr + l),H(Ts r + 1)) = 0, and using Lemma 5 in 
ED, we conclude then: 

EjfLMM} = E | T ^ + 0(U)}. 

Here D[ = G/{p — eg) = 0(1) and E{T'“ AM } denotes the 
expected time to get to within D\ of a*. Using ( [ 66 ] ) in Lemma 
jij and by defining D\ A D[ + 2V "-^ ax ^ r , we conclude that: 
EjyLRAM} = E { T ^ + 0(U)}. 

This proves & To prove ( |5T| ), note that the main dif¬ 
ference between DRAM and LRAM is that DRAM utilizes the 
system state information to “jump start” the algorithm. Us¬ 
ing Lemma [3] again, we see that with probability Ps z Ps r , 
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||(d(O),Q(O),ff(O))-0-a*|| < 
result and (f79|>, we conclude that: 




Combing this implies that: 


E { T DRA M } < E | T; + 


2 W max t? 
pe o 


)}■ 


i^T and 


i=l,2,...,oo 
xfc 


such that: 

k\ 


E 




E 




< 


2Vf max Sr 

PV 


This proves (|5T| and completes the proof of the theorem. 


Appendix F - Proof of Lemma[3] 

We present the proof for Lemma [3] here. For notation 
simplicity, we define / max = A r /3r max + c max . 

Proof: (Lemma[3]) Since with probability Pg r Ps z , S r < e r 
and S z < e zi we have from Assumption [T] that there exists 
a set of actions and probabilities that guarantee ( |38) , ( [39} , 
and (40 1 . Also, since 0 < < E{E(£)} 

and 0 < < E { e m(*)}> it can be shown 

that there exists r/i = 0 ( 1 ) > 0 , such that for any subset 
I„ C A f and any subset I m C Ad, there exist a set of actions 


This proves ( [ 66 } . 

To prove ( |65} , note that for any a, 

g(o') - g{a) = “ K k )g k (a). 


Therefore, with probability P$„, we have: 

< \z\s z (vf m 


\ A ~ d* 




r n 




<*(A n 


(84) 


(85) 


ft max) 


+ J2<(Nb ma 

m 

< S z Vf m ax'*?, 

where ■& = \Z\(1 +r n 


- A n 


' ftn 


-Nb n 


<)/v 


E * k E X i^in = E E X iVn{z k , b-) - t7l„77l, (79) 

k i k i 

where ox„ = 1 if n £ I n and <rx„ = — 1 otherwise. Similarly, 

E ^ E A " E b "mn = E ** E X i h im - ^ 1 ,( 80 ) 

kin k i 

where <7x m = 1 if m £ I m and ox m = — 1 otherwise. Then, 
using Lemma 1 in fT31 , we see that a* obtained by solving 
m satisfies that: 

~ (81) 


and the last inequality follows from (81 1 . This then implies 
that: 

\g(a*)-g(a*)\<26 z Vf mBX 0, ( 86 ) 

for otherwise we have: 

g(a*) - g(a*) < g{a*) + S Z V / max t? - g(a*) + S Z V / max t? < 0, 

which contradicts with the fact that a* achieves the minimum 


of g(a). Using the polyhedral structure, we see that (651 
follows. This completes the proof of the lemma. ■ 


Here // = niinf r/o, q-\). Moreover, ( [81} also holds for a* and 
a*. Now we look at g(a*) and g(a*). For explanation, we 
write g(a*) = g(a*,b ,R ,h ), where 7 *,b ,R ,h 
are the optimal actions corresponding to ci* with r and the 
true distribution 7r. From the definition, we know that: 

g(a*,Y,b*,R*,h*) (82) 

(a) _ _ - 

> g(a*,-y, b, R, h) 

+ E 7r *EE*Mz fc ,M„(,*,6 )) - r n (z k , p n (z k ,b ))] 


> g(a.*,ff* ,b*, R*,h*) — Vf„ 

> g(a*,-y*, b*, R , h*) — Vf„ 


<.$r / 'H 
K&r/V 




- *Al 


E TT k E«f \fn{z k ? Pn(zki b )) T n (^Z k: [4 n (z k ,b ))] 


> g{a*,j*,b ,R ,h ) - 2Vf max S r /g. (83) 

Here 7 , b. R. h denote the optimal actions corresponding to 
a* in g(ot), and (a) follows from the definition of g(a) and the 
fact that g(a*) achieves the supremum over all actions. In (b), 
we have used the fact that g(a*) achieves the minimum over 
all a, that the learning module guarantees that ||f — r|| max < 


S r , and (81 1 . Finally, in (c), 7 * ,b , R ,h are the actions 
corresponding to a* under g(a) and it follows again because 
g(ct*) achieves the supremum. The last inequality follows 


similarly to (b). Using the polyhedral structure of g(a), (831 










